471 research outputs found

    Current challenges in de novo plant genome sequencing and assembly

    Get PDF
    ABSTRACT: Genome sequencing is now affordable, but assembling plant genomes de novo remains challenging. We assess the state of the art of assembly and review the best practices for the community

    Sequencing the maize genome

    Get PDF
    Sequencing of complex genomes can be accomplished by enriching shotgun libraries for genes. In maize, gene-enrichment by copy-number normalization (high C(0)t) and methylation filtration (MF) have been used to generate up to two-fold coverage of the gene-space with less than 1 million sequencing reads. Simulations using sequenced bacterial artificial chromosome (BAC) clones predict that 5x coverage of gene-rich regions, accompanied by less than 1x coverage of subclones from BAC contigs, will generate high-quality mapped sequence that meets the needs of geneticists while accommodating unusually high levels of structural polymorphism. By sequencing several inbred strains, we propose a strategy for capturing this polymorphism to investigate hybrid vigor or heterosis

    Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome

    Get PDF
    Monitoring the progress of DNA molecules through a membrane pore has been postulated as a method for sequencing DNA for several decades. Recently, a nanopore-based sequencing instrument, the Oxford Nanopore MinION, has become available, and we used this for sequencing the Saccharomyces cerevisiae genome. To make use of these data, we developed a novel open-source hybrid error correction algorithm Nanocorr specifically for Oxford Nanopore reads, because existing packages were incapable of assembling the long read lengths (5-50 kbp) at such high error rates (between approximately 5% and 40% error). With this new method, we were able to perform a hybrid error correction of the nanopore reads using complementary MiSeq data and produce a de novo assembly that is highly contiguous and accurate: The contig N50 length is more than ten times greater than an Illumina-only assembly (678 kb versus 59.9 kbp) and has >99.88% consensus identity when compared to the reference. Furthermore, the assembly with the long nanopore reads presents a much more complete representation of the features of the genome and correctly assembles gene cassettes, rRNAs, transposable elements, and other genomic features that were almost entirely absent in the Illumina-only assembly

    Validation and assessment of variant calling pipelines for next-generation sequencing

    Get PDF
    Background: The processing and analysis of the large scale data generated by next-generation sequencing (NGS) experiments is challenging and is a burgeoning area of new methods development. Several new bioinformatics tools have been developed for calling sequence variants from NGS data. Here, we validate the variant calling of these tools and compare their relative accuracy to determine which data processing pipeline is optimal. Results: We developed a unified pipeline for processing NGS data that encompasses four modules: mapping, filtering, realignment and recalibration, and variant calling. We processed 130 subjects from an ongoing whole exome sequencing study through this pipeline. To evaluate the accuracy of each module, we conducted a series of comparisons between the single nucleotide variant (SNV) calls from the NGS data and either gold-standard Sanger sequencing on a total of 700 variants or array genotyping data on a total of 9,935 single-nucleotide polymorphisms. A head to head comparison showed that Genome Analysis Toolkit (GATK) provided more accurate calls than SAMtools (positive predictive value of 92.55% vs. 80.35%, respectively). Realignment of mapped reads and recalibration of base quality scores before SNV calling proved to be crucial to accurate variant calling. GATK HaplotypeCaller algorithm for variant calling outperformed the UnifiedGenotype algorithm. We also showed a relationship between mapping quality, read depth and allele balance, and SNV call accuracy. However, if best practices are used in data processing, then additional filtering based on these metrics provides little gains and accuracies of >99% are achievable. Conclusions: Our findings will help to determine the best approach for processing NGS data to confidently call variants for downstream analyses. To enable others to implement and replicate our results, all of our codes are freely available at http://metamoodics.org/wes

    Parallel comparison of Illumina RNA-Seq and Affymetrix microarray platforms on transcriptomic profiles generated from 5-aza-deoxy-cytidine treated HT-29 colon cancer cells and simulated datasets

    Get PDF
    BACKGROUND: High throughput parallel sequencing, RNA-Seq, has recently emerged as an appealing alternative to microarray in identifying differentially expressed genes (DEG) between biological groups. However, there still exists considerable discrepancy on gene expression measurements and DEG results between the two platforms. The objective of this study was to compare parallel paired-end RNA-Seq and microarray data generated on 5-azadeoxy-cytidine (5-Aza) treated HT-29 colon cancer cells with an additional simulation study. METHODS: We first performed general correlation analysis comparing gene expression profiles on both platforms. An Errors-In-Variables (EIV) regression model was subsequently applied to assess proportional and fixed biases between the two technologies. Then several existing algorithms, designed for DEG identification in RNA-Seq and microarray data, were applied to compare the cross-platform overlaps with respect to DEG lists, which were further validated using qRT-PCR assays on selected genes. Functional analyses were subsequently conducted using Ingenuity Pathway Analysis (IPA). RESULTS: Pearson and Spearman correlation coefficients between the RNA-Seq and microarray data each exceeded 0.80, with 66%~68% overlap of genes on both platforms. The EIV regression model indicated the existence of both fixed and proportional biases between the two platforms. The DESeq and baySeq algorithms (RNA-Seq) and the SAM and eBayes algorithms (microarray) achieved the highest cross-platform overlap rate in DEG results from both experimental and simulated datasets. DESeq method exhibited a better control on the false discovery rate than baySeq on the simulated dataset although it performed slightly inferior to baySeq in the sensitivity test. RNA-Seq and qRT-PCR, but not microarray data, confirmed the expected reversal of SPARC gene suppression after treating HT-29 cells with 5-Aza. Thirty-three IPA canonical pathways were identified by both microarray and RNA-Seq data, 152 pathways by RNA-Seq data only, and none by microarray data only. CONCLUSIONS: These results suggest that RNA-Seq has advantages over microarray in identification of DEGs with the most consistent results generated from DESeq and SAM methods. The EIV regression model reveals both fixed and proportional biases between RNA-Seq and microarray. This may explain in part the lower cross-platform overlap in DEG lists compared to those in detectable genes

    Integrated RNA-seq and sRNA-seq analysis identifies novel nitrate-responsive genes in Arabidopsis thaliana roots

    Get PDF
    Background:Nitrate and other nitrogen metabolites can act as signals that regulate global gene expression in plants. Adaptive changes in plant morphology and physiology triggered by changes in nitrate availability are partly explained by these changes in gene expression. Despite several genome-wide efforts to identify nitrate-regulated genes, no comprehensive study of the Arabidopsis root transcriptome under contrasting nitrate conditions has been carried out. Results:In this work, we employed the Illumina high throughput sequencing technology to perform an integrated analysis of the poly-A + enriched and the small RNA fractions of the Arabidopsis thaliana root transcriptome in response to nitrate treatments. Our sequencing strategy identified new nitrate-regulated genes including 40 genes not represented in the ATH1 Affymetrix GeneChip, a novel nitrate-responsive antisense transcript and a new nitrate responsive miRNA/TARGET module consisting of a novel microRNA, miR5640 and its target, AtPPC3. Conclusions:Sequencing of small RNAs and mRNAs uncovered new genes, and enabled us to develop new hypotheses for nitrate regulation and coordination of carbon and nitrogen metabolism

    Syntenic relationships between Medicago truncatula and Arabidopsis reveal extensive divergence of genome organization

    Get PDF
    Arabidopsis and Medicago truncatula represent sister clades within the dicot subclass Rosidae. We used genetic map-based and bacterial artificial chromosome sequence-based approaches to estimate the level of synteny between the genomes of these model plant species. Mapping of 82 tentative orthologous gene pairs reveals a lack of extended macrosynteny between the two genomes, although marker collinearity is frequently observed over small genetic intervals. Divergence estimates based on non-synonymous nucleotide substitutions suggest that a majority of the genes under analysis have experienced duplication in Arabidopsis subsequent to divergence of the two genomes, potentially confounding synteny analysis. Moreover, in cases of localized synteny, genetically linked loci in M. truncatula often share multiple points of synteny with Arabidopsis; this latter observation is consistent with the large number of segmental duplications that compose the Arabidopsis genome. More detailed analysis, based on complete sequencing and annotation of three M. truncatula bacterial artificial chromosome contigs suggests that the two genomes are related by networks of microsynteny that are often highly degenerate. In some cases, the erosion of microsynteny could be ascribed to the selective gene loss from duplicated loci, whereas in other cases, it is due to the absence of close homologs of M. truncatula genes in Arabidopsis

    Two waves of de novo methylation during mouse germ cell development

    Get PDF
    During development, mammalian germ cells reprogram their epigenomes via a genome-wide erasure and de novo rewriting of DNA methylation marks. We know little of how methylation patterns are specifically determined. The piRNA pathway is thought to target the bulk of retrotransposon methylation. Here we show that most retrotransposon sequences are modified by default de novo methylation. However, potentially active retrotransposon copies evade this initial wave, likely mimicking features of protein-coding genes. These elements remain transcriptionally active and become targets of piRNA-mediated methylation. Thus, we posit that these two waves play essential roles in resetting germ cell epigenomes at each generation

    A Hybrid Likelihood Model for Sequence-Based Disease Association Studies

    Get PDF
    In the past few years, case-control studies of common diseases have shifted their focus from single genes to whole exomes. New sequencing technologies now routinely detect hundreds of thousands of sequence variants in a single study, many of which are rare or even novel. The limitation of classical single-marker association analysis for rare variants has been a challenge in such studies. A new generation of statistical methods for case-control association studies has been developed to meet this challenge. A common approach to association analysis of rare variants is the burden-style collapsing methods to combine rare variant data within individuals across or within genes. Here, we propose a new hybrid likelihood model that combines a burden test with a test of the position distribution of variants. In extensive simulations and on empirical data from the Dallas Heart Study, the new model demonstrates consistently good power, in particular when applied to a gene set (e.g., multiple candidate genes with shared biological function or pathway), when rare variants cluster in key functional regions of a gene, and when protective variants are present. When applied to data from an ongoing sequencing study of bipolar disorder (191 cases, 107 controls), the model identifies seven gene sets with nominal p-values<0.05, of which one MAPK signaling pathway (KEGG) reaches trend-level significance after correcting for multiple testing. © 2013 Chen et al
    corecore